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Abstract 

In the standard model of cosmology the universe starts with a hot Big Bang. As the universe expands it cools, 
and after 300,000 years it drops below the ionisation temperature of hydrogen. The previously free electrons 
become bound to protons, and with no electrons for the photons to scatter off they continue undeflected to us 
today. This image of the surface of last-scattering is what we call the Cosmic (because it fills the universe) 
Microwave (because of the frequency at which its black body spectrum peaks today) Background (because it 
originates behind all other light sources). Despite its stunning uniformity - isotropic to a few parts in a million - 
it is the tiny perturbations in the CMB that give us an unprecedented view of the early universe. First detected 
by the COBE satellite in 1991, these anisotropies are an imprint of the primordial density fluctuations needed 
to seed the development of gravitationally bound objects in the universe, and are potentially the most powerful 
discriminant between cosmological models. 

Realizing the extraordinary scientific potential of the CMB requires precise measurements of these tiny 
anisotropies over a significant fraction of the sky at very high resolution. The analysis of the resulting datasets 
is a serious computational challenge. Existing algorithms require terabytes of memory and hundreds of years 
of CPU time. We must therefore both maximize our resources by moving to supercomputers and minimize our 
requirements by algorithmic development. Here we will outline the nature of the challenge, present our current 
optimal algorithm, discuss its implementation - as the MADCAP software package - and its application to data 
from the North American test flight of the joint Italian-U.S. BOOMERanG experiment on the Cray T3E at 
NERSC and CINECA. 

A documented /3-release of MADCAP is publicly available at 

http: / /cfpa.berkeleyedu/^borrill/cmb/madcap.htinj 



1 Introduction 



The current standard model of cosmology starts with a hot Big Bang. Whilst there is still debate about what this 
actually means, it is generally agreed is that what emerges is a very hot, expanding, space-time — the Universe. 
As the Universe expands its temperature falls and about 300,000 years after the Big Bang it cools to below the 
ionisation temperature of hydrogen and the previously free electrons become bound to protons. With no electrons 
to scatter off, the photons propagate undeflected through space to the present. When we observe this radiation 
today we are seeing the universe as it was when it was 1/40, 000th of its present age. This image of the epoch 
when the primordial photons last scattered is what we call the Cosmic Microwave Background (CMB) radiation. 

Detected serendipitously in 1969, the CMB was considered the decisive argument in the debate of the day 
between the 'Steady State' and 'Big Bang' cosmologies. Today extraordinary instruments are measuring the 
tiniest variations in the CMB photons' temperature in different directions, and the results they are giving hold the 
promise of settling this generation's cosmological debates [|l|]. They have the potential to describe the geometry 
of space-time, showing whether straight lines continue forever or not; they can tell us how much mass and energy 
the universe contains, and the forms that it takes; and they may shed light the kinds of things that could have 
happened in the very first moments of the Big Bang. 

As observers began to search for fluctuations in the CMB temperature they soon discovered that it was 
astonishingly uniform - 2.735 Kelvin in every direction. Building ever more sensitive detectors, the first variations 
weren't discovered until the milliKelvin regime. However these were due our galaxy's motion through space, a 
Doppler effect making the universe appear hotter in the direction in which we are going and colder in the direction 
from which we have come. Finally in 1992 the COBE satellite team reported intrinsic spatial variation in the 
CMB's temperature of around ±30/i-fT when averaged over patches of sky approximately 10° across. 

Since that detection the focus has been on measuring the extent of the variation with ever greater precision on 
ever smaller angular scales. Both the absolute and the relative power of the fluctuations on different angular scales 
contain a wealth of cosmological information. For example, we expect to see a peak in the power at the angular 
scale corresponding to the horizon size (the limiting scale for coherent physical processes) at last scattering. The 
apparent angular size on the sky today of a particular physical scale in the past depends on the geometry of the 
Universe - decreasing in size as the curvature of space increases. A measurement of the location of the peak in the 
angular power spectrum therefore tells us whether the Universe is open, flat or closed. Similarly the height of this 
peak is related to the total energy density in the universe, and how it is distributed. The presence or absence of 
lower secondary peaks, at resonances of the fundamental scale, may allow us to rule out some classes of theories 
of the origins of the first density perturbations in the universe. 

Making an observation with small enough statistical and systematic error bars to resolve the detailed shape 
of this angular power spectrum requires very sensitive detectors scanning a significant fraction of the sky at very 
high resolution. These conditions are now being met for the first time in balloon-borne experiments such the 
joint Italian-U.S. BOOMERanG project. Lifted into the stratosphere to minimize atmospheric interference, this 
experiment measures changes in the voltage across an extremely sensitive bolometer, cooled to a few milliKelvin, 
as it scans the sky. Since the voltage across a bolometer depends on it's temperature, hidden within such a data 
set is a measurement of CMB temperature fluctuations. 

Reducing tens of millions of individual observations to an angular power spectrum of a thousand multipoles 
is a computationally challenging task. Each observation contains detector noise as well as signal on the sky; the 
signal on the sky includes not only the CMB but also foreground sources of microwave radiation such as interstellar 
dust; and both the noise and the signal components of nearby observations are correlated. At present, except in 
very restrictive circumstances, algorithms to solve the equations relating the time-ordered data to a map of the 
sky temperature, and then to the angular power spectrum, scale as the number of map pixels squared in memory 
and cubed in floating point operation count. The COBE map contained only a few thousand pixels, but current 
balloon observations are generating maps with tens and hundreds of thousands pixels, and the MAP and PLANCK 
satellites — to be launched in 2000 and 2007 respectively — will increase this to millions and tens of millions. 
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To realize the full potential of these CMB observations we simultaneously need to maximize our computational 
resources (by moving to supercomputers) and to minimize our computational requirements (by optimizing our 
algorithms and implementations). Here we present our current optimal algorithm and discuss its implementation 
in the MADCAP software package and describe its application at NERSC and CINECA to data from the North 
American test flight of BOOMERanG. For simplicity we will consider an observation from a single detector with 
no significant foreground contamination — a situation that was realized in practice in the analysis of the best 
channel of the BOOMERanG North America data. 

In what follows vectors and matrices are written in plain fonts in the time domain and italic fonts in the pixel 
domain. 



2 From The Time- Ordered Data To The Map 
2.1 Algorithm 

Our first step is to translate the observation from the temporal to the spatial domain — to make a map ||. 
Knowing where the detector was pointing, (9t,ipt), at each of the Aft observation, and dividing the sky into Af p 
pixels, we can construct an Aft x Af p pointing matrix A whose entries give the weight of pixel p in observation t. 
For a total power scanning experiment such as BOOMERanG this has a particularly simple form 

A = / 1 if ep (1) 

tp 1 otherwise 

while a more complex observing strategy would give a correspondingly complex structure. 
The time-ordered data vector can now be written 

d = As + n (2) 

in terms of the pixelised CMB signal s and time-stream noise n. Under the assumption of Gaussianity, the noise 
probability distribution is 

P(n) = (27r)~ M / 2 exp|-i (n T N" 1 n + Tr [InN]) J (3) 

where N is the time-time noise correlation matrix given by 

N=(nn T ) (4) 

We can now use equation (|2[) to substitute for the noise in equation (|3|), so the probability that, with a particular 
underlying CMB signal, we would have obtained the observed time-ordered data is 

P(d|s) = (2vr)- A, ' t / 2 exp|-i ((d - As) T r ! (d - A s) + Tr [InN]) J (5) 

Assuming that all CMB signals are a priori equally likely, this is proportional to the likelihood of the CMB signal 
given the time-ordered data. Maximizing over s now gives the maximum likelihood pixelized data (or map) d 

d = (a t N _1 a) _1 A T N _1 d (6) 

Substituting back for the time-ordered data in equation @ we recover the obvious fact that this pixelized data 
is the sum of the true CMB signal and some residual pixelized noise 

A T N" 1 A) ~* A T N" 1 (A s + n) 
s + n (7) 
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where this pixel noise 

n= (a t N -1 a) -1 A T N -1 n (8) 

has correlations given by 

N = (nn T ) 

= (a^^a)" 1 (9) 

The map-making algorithm can therefore be divided into two steps, 
Ml - construct the inverse pixel-pixel noise correlation matrix and noise-weighted map 

N- 1 = A T N _1 A 
z = N~ 1 d = A T N _1 d 

M2 - solve for the pixelized data 

d = (N- i r i z 

which are encoded in the two map-making modules — inv_pp_noise.c and p-data.c — in MADCAP. 
2.2 Implementation 

The first half of Table 1 shows the computational cost of a brute force implementation of each of the steps in 
the map-making algorithm (recall that multiplying an [a x b] matrix and a [b x c] matrix in general requires 

2 a be operations). For the BOOMERanG North America data — with Mt ~ 1.5 x 10 6 and M p ~ 2.4 x 10 4 - 
simply making the map would require 9 Tb disc space (storing data in 4- byte precision), 18 Tb RAM[] (doing all 
calculations in 8-byte precision) and 10 floating point operations. 

Calculation Brute Force Structure-Exploiting 
Disc RAM Flops Disc RAM Flops 

Ml 4 A/? 8N? 2NiNp 4(A^+M) 8(N%+AT t ) ZN T N t 
M2 4 A/; 2 8 A/" 2 (2 + |)AA p 3 4 A/; 2 8 A/" 2 (2 + §)A/" p 3 

Total 4 A/" 2 8 A/" 2 2A" 2 A" P 4(N*+N t ) 8(AA 2 +A;) (2 + |)A; 3 
Table 1: Computational requirements for the map-making algorithm 

Fortunately there are two crucial structural features to be exploited here. As noted above, for a simple scanning 
experiment like BOOMERanG the pointing matrix A is very sparse, with only a single 1 in each row. Moreover, 
the inverse time-time noise correlations are (by fiat) both stationary and fall to zero beyond some time-separation 
much shorter than the duration of the observation 

N-/ = f(\t-t'\) 

= V \t-t'\ > t <A/" t (10) 

so that the inverse time-time noise correlation matrix is symmetric and band-diagonal, with bandwidth Af T = 2r+l. 
The second half of Table 1 shows the impact of exploiting this structure on the cost of each step. The limiting 

Since all algorithms here will be operation-count limited we always assume in-core implementations. Out-of-core methods would 
reduce memory requirements but prohibitively increase the run-time overhead, for example in the cache hit cost of moving from level 

3 (matrix-matrix) to level 1 (vector- vector) BLAS. 
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step is now no longer constructing the inverse pixel-pixel noise correlation matrix but inverting it and solving for 
the map. Although this inversion is not necessary to obtain just the map, the current power spectrum algorithm 
requires the pixel-pixel noise correlation matrix itself and not simply its inverse. We therefore Cholesky decompose 
the positive definite matrix A r_1 , and use the decomposition both to solve for the map and to calculate N. For 
the same dataset making the map now takes 2.3 Gb of disc, 4.6 Gb of RAM, and 3.7 x 10 13 flops. 

Although the final analysis of the data is dominated by the second step, 'quick and dirty' systematics tests 
can be performed at much lower map resolution (ie. with much larger pixels) to reduce N p by up to an order of 
magnitude. At this point it becomes important to optimize the first step too. The structure exploiting algorithm 
reduces these calculations to: 

for each observation (at time t, of pixel p) 

for each observation within r of it (at time t', of pixel p') 
add f(\t-t'\) to N-) 
add f(\t — t'\) df to z p 

Dividing this into blocks of contiguous p-pixels allows each processor to work independently. However the number 
of times a pixel is observed can vary from a few to thousands, so a simple division into equal numbers of p-pixels on 
each processor can be very poorly load-balanced. The MADCAP implementation therefore starts by determining 
the number of operations required for each p-pixel and dividing them among the processors in blocks which are 
still contiguous but whose numbers of p-pixels vary so that the total operation count in each block is as close to 
the mean count per processor as possible. Although this means that the memory requirement per processor varies 
(and is not known exactly in advance) for the BOOMERanG North America data it reduced the run-time of these 
steps by a factor of two or more. 

3 From The Map To The Power Spectrum 
3.1 Algorithm 

We now want to move to a basis where the CMB observation can be compared with the predictions of various 
cosmological theories — the angular power spectrum. We decompose the CMB signal at each pixel in spherical 
harmonics 

s P = Yl a i™ Bi Yi m (6 p , tp p ) (11) 

lm 

where B is the pattern of the observation beam (assumed to be circularly symmetric) in /-space. The correlations 
between such signals then become 

Spp' = (spSpi) = X] X] ( a im a i'm') BiBi'Yi m (0p,ipp)Yi' m i(6 p ',ippi) (12) 

lm I'm' 

For isotropic fluctuations the correlations depend only on the angular separation 

{ai m ai'm') = Q 5 U ' 6 mm > (13) 

and the pixel-pixel signal correlation matrix entries become 

2/ + 1 

V = £-£- 5 J 2< 3^0fepO (14) 
I 

where Pi is the Legendre polynomial and Xpp' the angle between the pixel pair p,p'. These C/ multipole powers 
completely characterize a Gaussian CMB, and are an otherwise model-independent basis in which to compare 
theory with observations. The finite beam size means that any experiment has a maximum multipole sensitivity, 
above which all power is beam-smeared. Coupled with incomplete sky coverage, this means that in practice the Q 
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that we extract do not form a complete orthonormal basis. We therefore group the Mi accessible multipoles into 
Mb bins, adopting a fixed spectral shape function Cf and characterizing the CMB signal by its bin powers C& with 



C, = C, 



Cs 
... _.. I 



(15) 



Since the signal and noise are assumed to be realizations of independent Gaussian processes the pixel-pixel 
map correlations are 



D 



(dd T ) 

(ss T ) + (nn T ) 
S + N 



and the probability distribution of the map given a particular power spectrum C is now 

P{d\C) = {2Tr)- Ar p /2 exp 1-- (d T D- 1 d + Tr [InD]) j 



(16) 



(17) 



Assuming a uniform prior for the spectra, this is proportional to the likelihood of the power spectrum given the 
map. Maximizing this over C then gives us the required result, namely the most likely CMB power spectrum 
underlying the original observation d. 

Finding the maximum of the likelihood function of equation (|i"7|) is a much harder problem than making the 
map. Since there is no closed-form solution corresponding to equation (^) we must find both a fast way to evaluate 
the likelihood function at a point, and an efficient way to search the A/fe-dimensional parameter space for the peak. 
The fastest general method extant is to use Newton-Raphson iteration to find the zero of the derivative of the 
logarithm of the likelihood function ||. If the log likelihood function 



C(C) = -- (d T D^d+TrlhiD] 



(18) 



were quadratic, then starting from some initial guess at the maximum likelihood power spectrum C the correction 
5C that would take us to the true peak would simply be 





'd 2 c 


1 dc\ 


( 


dC 2 


dC J 



5C Q = 

Since the log likelihood function is not quadratic, we now take 

d = C + 5C 



c=c 



(19) 



(20) 



and iterate until 5C n ~ to the desired accuracy. Because any function is approximately quadratic near a peak, 
if we start searching sufficiently close to a peak this algorithm will converge to it. Of course there is no guarantee 
that it will be the global maximum, and in general there is no certainty about what 'sufficiently close' means in 
practice. However experience to date suggests that the log likelihood function is sufficiently strongly singly peaked 
to allow us to use this algorithm with some confidence. 

The core of the algorithm is then to calculate the first two derivatives of the log likelihood function with respect 
to the multipole bin powers 



dC 

dC b 



r ( d T D~ 



d 2 c 



dC h dC v 



-d T D~ 



1 dS 

da 

dS 



dC h 



D 



D~ x d- 
dS 



Tr 



zr 



dS 



dC b 



dCy 



D 



~ l d+ -Tr 
2 



D- 



dS 
dC h 



D- 



dS 



(21) 
(22) 
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Each iteration of the power-spectrum extraction algorithm can therefore be divided into six steps, 
PI Calculate the Mb pixel-pixel signal correlation bin derivative matrices 

dS _^ 2Z + 1 v2ris 



B i Cf p i(Xpp') 



P2 - Construct the pixel-pixel map correlation matrix, Cholesky decompose it, and triangular solve for the data- 
weighted map 

D = N + YC b — = LL T & LL v z = d 



P3 - Triangular solve the linear systems 



P4 - Assemble the first derivative 



LL W b 



OS 



dC 1 



dC b 2 



[d T W b z-Tr[W b ]) 



P5 - Assemble the second derivative 



d 2 C 1 



P6 - Calculate the spectral correction 



SC = - 



d 2 c 

OC 2 



dc 

dC 



which are encoded in the six power-spectrum extraction modules — ppsignal.c, L.c, trisolve.c, dLdC.c, d2LdC2.c 
and dC.c — in MADCAP. 



3.2 Implementation 

Table 2 shows the computational cost of each of the steps in each iteration of the power-spectrum algorithm. 
Obtaining the maximum likelihood power spectrum for the BOOMERanG North America data — with M p ~ 
2.4 x 10 4 , Mb ~ 10 and Mi ~ 1200 — requires 50 Gb of disc space, 9.2 Gb of RAM and 2.8 x 10 14 flops. 

Calculation Disc RAM Flops 

PI AM b M 2 

P2 A{M b + 2)M 2 

P3 A{2M b + l)M 2 
P4 AM b M 2 
P5 4M b M 2 
P6 4M b 2 

Total 4(2M b + 2)M 2 16 M 2 (2A4 + |)AA p 3 
Table 2: Computational requirements for each iteration of the power spectrum algorithm 

Note that in step PI we calculate the pixel-pixel signal correlation bin derivative matrices (which we need in 
step P3 anyway), rather than the full pixel-pixel signal correlation matrix. Since these are independent of the 



16M 2 0(MiM?) 
16 M 2 §a/; 3 
16 M 2 2M b Mp 
8M 2 2M b M 2 

16 M 2 3M b 2 M 2 

8Mt (2 + I) Mi 
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bin power step PI does not need to be repeated at each iteration. The trade-off here is the near doubling of disc 
space required, since we now want to keep these matrices throughout the calculation, and not overwrite them in 
step P3. If disc space is at a premium, however, we can simply revert to recalculation. In step P2 we calculate all 
the terms necessary to construct the log- likelihood itself dig) , which is therefore generated as a useful by-product, 
allowing us to confirm that the algorithm is moving to higher likelihood and to check the appropriateness of our 
convergence criterion. In practice over 80% of the run time is spent in the triangular solves of step P3. Since these 
involve level 3 BLAS only they are highly optimized, and we typically reach 40-80 % peak performance on the 
T3E both at CINECA and at NERSC, with the fraction slowly decreasing with the total number of processors 
used 

4 Conclusions 

MADCAP is a highly optimized, portable, parallel implementation (using ANSI C, MPI and the ScaLAPACK 
libraries) of the current optimal general algorithm for extracting the most useful cosmological information from 
total-power observations of the CMB. The /^-release of MADCAP has been ported to a wide range of parallel 
platforms — including the Cray T3E, SGI Origin 2000, HP Exemplar and IBM SP2. The combination of parallelism 
and algorithmic optimization has enabled CMB datasets that would previously have been intractable to be analyzed 
in a matter of hours. We have successfully applied MADCAP at CINECA and NERSC to the data from the North 
American Test flight of the BOOMERanG experiment; the scientific results will be published shortly 0|. 

Existing algorithms are capable of dealing with CMB datasets with up to 10 5 pixels. Over the next 10 years 
a ranee of observations are expected to produce datasets of 5 x 10 5 (BOOMERanG LDB), 10 6 (MAP) and 10 7 
(PLANCK) pixels. As shown in Table 3 (where we have assumed a constant 10 multipole bins and 5 Newton- 
Raphson iterations) the scaling with projected map size pushes the analysis of these observations well beyond the 
capacity of even the most powerful current supercomputers. 



Flight 




Disc 


RAM 


Flops 


BOOMERanG NA 


24,000 


50 Gb 


9 Gb 


1.4 x 10 15 


MAXIMA 1 


32,000 


90 Gb 


16 Gb 


3.2 x 10 15 


MAXIMA 2 


80,000 


560 Gb 


100 Gb 


5.1 x 10 16 


BOOMERanG LDB 


450,000 


18 Tb 


3 Tb 


3.7 x 10 18 


PLANCK 


20,000,000 


35 Pb 


2.4 Pb 


8 x 10 23 



Table 3: MADCAP computational requirements for current & forthcoming CMB observations 

Such datasets will require new algorithms. The limiting steps in the above analysis are associated with 
operations involving the pixel-pixel correlation matrices for the noise N, the signal S, and most particularly their 
sum D. The problem here is the noise and the signal have different natural bases. The inverse noise correlations 
are symmetric, band-diagonal and approximately circulant in the time domain, while the signal correlations are 
diagonal in the spherical harmonic domain. Moreover, thus far we have ignored foreground sources which will be 
spatially correlated and so most naturally expressed in the pixel domain. The fundamental problem is then that 
the data correlations — being the sum of these three terms — are complex in all three domains. Developing the 
approximations necessary to handle such data sets is an area of ongoing research. 
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